| Author | |
|---|---|
| Name | Claire Descombes |
| Affiliation | Universitätsklinik für Neurochirurgie, Inselspital Bern |
| Degree | MSc Statistics and Data Science, University of Bern |
| Contact | claire.descombes@insel.ch |
The reference material for this course, as well as some useful literature to deepen your knowledge of R, can be found at the bottom of the page.
R is a free and open source statistical computing and graphics software. RStudio is a user-friendly environment for R, designed to facilitate its accessibility. R can technically be used without RStudio (although I wouldn’t advise it), but the reverse is not possible. To download both, follow the links below:
✏️ Download both softwares.
Once you have downloaded both programs and opened RStudio, you will be presented with a window similar to the one shown in the following figure.
In the Console tab, we first see information about the version of R we are using and some basic commands to try out. At the end of these descriptions, we can type our R code, press Enter and see the result below the code line.
2+2
## [1] 4
The help() function and ? help operator in
R provide access to the documentation pages for R functions, data sets,
and other objects, both for packages in the standard R distribution and
for contributed packages.
You can access help directly from the console or via the Help tab in the bottom right-hand corner.
help(c)
# or equivalently
? c
However, when we run our code directly in the console, it isn’t saved for being reproduced further. If we need (and we usually do) to write a reproducible code to solve a specific task, we have to record and regularly save it in a script file rather than in the console.
To start recording a script, click File – New File – R Script. This will open a text editor in the top-left corner of the RStudio interface (above the Console tab, see following figure).
✏️ Create your own script. Feel free to take notes directly in it. You’ll use this script as a working document to complete various small tasks and exercises.
☑️ All exercises have an example solution at the end of the chapter.
💡 When your code starts to get long or complex, consider breaking it
down into separate scripts with clear and specific purposes — for
example: 1_data_import.R, 2_data_cleaning.R,
3_survival_analysis.R, 4_qol_analysis.R,
etc.
In R, everything is an object. This means that every piece of data you work with, from a single number to a complex dataset, is represented as an object with specific properties and behaviours. An object has attributes like class (data type) and dimensions.
Variables act as labels for objects. They are essentially pointers to the actual object stored in memory and appear in the Environment tab in RStudio.
Here’s an example to clarify the difference between variables and objects.
# We create an object (here: a vector) named 'vec' and assign a sequence of numbers to it.
vec <- 1:10
# 'vec' is the variable. The sequence of numbers (1, 2, 3, ..., 10) is the object.
💡 To assign values to an object, use the <- or
= symbols.
Before diving into data types and structures, it’s helpful to know how to inspect objects in R. Several built-in functions can help you understand the structure and content of an object.
Let’s define a simple data frame (more details about data frames below) to demonstrate the purpose of those functions.
# Example data frame
df <- data.frame(
ID = 1:5,
Name = c("Anna", "Ben", "Carla", "David", "Eva"),
Age = c(23, 31, 29, 40, 35)
)
Now let’s inspect thus object using a few useful functions.
# General object inspection
typeof(df) # Returns the internal storage type of the object
## [1] "list"
str(df) # Gives a compact, human-readable summary of the object's structure
## 'data.frame': 5 obs. of 3 variables:
## $ ID : int 1 2 3 4 5
## $ Name: chr "Anna" "Ben" "Carla" "David" ...
## $ Age : num 23 31 29 40 35
attributes(df) # Lists the object's attributes (e.g., names, dimensions, class)
## $names
## [1] "ID" "Name" "Age"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5
str(attributes(df)) # Displays the structure of the attributes
## List of 3
## $ names : chr [1:3] "ID" "Name" "Age"
## $ class : chr "data.frame"
## $ row.names: int [1:5] 1 2 3 4 5
class(df) # Returns the class of the object (e.g., data.frame)
## [1] "data.frame"
# Functions especially useful for matrices or data frames
nrow(df) # Number of rows
## [1] 5
ncol(df) # Number of columns
## [1] 3
dim(df) # Dimensions (rows, columns)
## [1] 5 3
colnames(df) # Column names
## [1] "ID" "Name" "Age"
rownames(df) # Row names
## [1] "1" "2" "3" "4" "5"
str(colnames(df)) # Structure of the column names (e.g., character vector)
## chr [1:3] "ID" "Name" "Age"
str(rownames(df)) # Structure of the row names (e.g., character vector)
## chr [1:5] "1" "2" "3" "4" "5"
💡 The function str() provides a compact view of the
internal structure of an R object, helping you understand its components
and data types quickly.
In R, data types define the kind of information a variable can hold. Here are some of the most common data types:
typeof(3.14)
## [1] "double"
str(3.14)
## num 3.14
typeof(2L)
## [1] "integer"
str(2L)
## int 2
typeof(TRUE)
## [1] "logical"
str(TRUE)
## logi TRUE
# Example of a logical operation
values <- 1:10
above_five <- (values > 5)
above_five
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
✏️ Exercise on Booleans: Given a vector of ages
(ages <- c(35, 45, 60, 15, 50, 8)), determine which
patients are eligible for a treatment (age above 18). Return a Boolean
vector indicating whether each patient meets the age criteria.
typeof("hello")
## [1] "character"
str("hello")
## chr "hello"
typeof(1 + 2i)
## [1] "complex"
str(1 + 2i)
## cplx 1+2i
In some cases the components of a vector may not be known. When an
element or value is “not available” or a “missing value” in the
statistical sense, a place within a vector may be reserved for it by
assigning it the special value NA. In general any operation
on an NA becomes an NA.
z <- c(1:3,NA)
print(z)
## [1] 1 2 3 NA
is.na(z)
## [1] FALSE FALSE FALSE TRUE
There is a second kind of “missing” values which are produced by
numerical computation, the so-called Not a Number, NaN,
values.
0/0
## [1] NaN
Inf - Inf
## [1] NaN
Objects are the entities that R operates on. These can be:
c() function.vec1 <- c(1,2,3)
str(vec1)
## num [1:3] 1 2 3
# Alternative ways of creating vectors:
vec2 <- 1:3 # Sequence of integers
vec3 <- seq(1, 3, by=1) # More general sequence
Vector elements can be accessed using [] brackets.
# Accessing elements of the vector by index (R uses 1-based indexing)
vec1[1] # First element
## [1] 1
vec2[c(2, 3)] # Elements at indices 2 and 3
## [1] 2 3
vec3[c(-2)] # All elements except for the element at index 2
## [1] 1 3
matrix() function.(mat <- matrix(c(1,2,3,4), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
str(mat)
## num [1:2, 1:2] 1 2 3 4
💡 By enclosing the assignment in parentheses (), you
not only create the object but also automatically print its value to the
console — a useful shortcut. This is equivalent to writing
print(object) or simply typing the object name (e.g.,
object), but it saves you an extra line of code.
(array <- array(1:8, c(2,4,2)))
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
str(array)
## int [1:2, 1:4, 1:2] 1 2 3 4 5 6 7 8 1 2 ...
(list <- list(numb = 10:15, char = 'hello'))
## $numb
## [1] 10 11 12 13 14 15
##
## $char
## [1] "hello"
str(list)
## List of 2
## $ numb: int [1:6] 10 11 12 13 14 15
## $ char: chr "hello"
(fac <- factor(c("single", "married")))
## [1] single married
## Levels: married single
str(fac)
## Factor w/ 2 levels "married","single": 2 1
(d <- data.frame(id = 1:5,
val = c(4,5,2,6,5),
group = sample(c("exp","control"), size = 5, replace = TRUE)))
str(d)
## 'data.frame': 5 obs. of 3 variables:
## $ id : int 1 2 3 4 5
## $ val : num 4 5 2 6 5
## $ group: chr "control" "exp" "exp" "control" ...
✏️ Create a data frame with 10 rows and the columns id,
blood_pressure, and group. – id:
integers from 1 to 10 – blood_pressure: random values from
a normal distribution with mean 123 and standard deviation 8 –
group: a factor with levels “drug 1”, “drug 2”, and “obs
arm” (you decide how to assign them, e.g. by using the function
sample()) After creating that data frame, add a new column
stand_score where you calculate a standardized score for
each blood_pressure value. The standardized score is similar to a
z-score but is calculated based on the mean and standard deviation of
the blood_pressure values in the dataset (standardized
score = (x−μ)/σ)).
💡 Use the function rnorm() to simulate normal values.
Use the function scale() to centre and scale a vector, or
alternatively the functions mean() and sd() to
compute mean and standard deviation of a vector. You can use
help() to learn more about how these functions work.
frac <- function(numerator, denominator) {
result <- numerator / denominator
return(result)
}
frac(6, 2) # Calling the function
## [1] 3
✏️ Write a function sum_squared that takes two integers
and returns the sum of their squared values.
# Example
sum_squared(2,3)
# The output should be:
13
When you get a file from somewhere on your computer (e.g. a dataset), you can either
The advantage of putting the files in the folder that contains your script and is set as the working directory is that you can easily move the folder around on your computer without getting any problems with your script: just set the working directory to your source file every time you open it, and you’ll be fine.
# Example
setwd("~/path/to/your/folder/")
data <- read.csv("testdata.csv")
The advantage of always giving the full path to a file is that you can get data in different folders on your computer, avoiding things like copying the source data in every folder where you have a corresponding script.
# Example
data <- read.csv("~/path/to/your/folder/testdata.csv")
To find out what your current working directory is, you can use the
function getwd().
getwd()
## [1] "/home/claire/Documents/GitHub/rforphysicians/docs"
Working directory
To tell R which folder you are working in (e.g., where your data is stored), you have several options:
setwd("path/to/your/folder") in your script.💡 I recommend placing both your script and your data files in the same folder, and setting that folder as your working directory. This helps avoid errors caused by R not finding your data.
getwd() # Displays the current working directory
setwd("path/to/your/folder") # Sets the working directory
We will first look at how to import a CSV file into R as a data frame.
CSV stands for Comma-Separated Values. In a .csv file,
the values are stored as plain text, separated by commas. This is a
simple and widely used format for storing tabular data.
After setting your working directory or determining the path to your
CSV file, you can use the read.csv() function to import the
data. This will create a data frame, which is one of the most commonly
used structures in R for handling datasets.
# Import a CSV file into a data frame
dataset <- read.csv("~/path/to/your/folder/data.csv")
💡 I recommend using data frames — they are generally easier to work with than matrices, especially for beginners.
Another widely used data format is the Excel file
(.xlsx). For these, you can use the readxl
package to import the data:
# Load the readxl package
library(readxl)
# Read the first sheet of an Excel file
dataset <- read_excel("~/path/to/your/folder/data.xlsx")
⚠️ Note: If your file is actually a CSV but mistakenly has a .xlsx extension, you should rename it to .csv and use read.csv() instead. Mixing up formats can lead to import errors.
Let us now look at real data frames to learn how to call or modify
their elements. To do this, we will use multiple health data sets from
the National Health and Nutrition Examination (NHANES) Survey
from 2011-2012. The survey assessed overall health and nutrition of
adults and children in the United States and was conducted by the
National Center for Health Statistics (NCHS). The data sets can be found
in the data_sets folder
folder.
| Dataset | NHANES Code | Description | CSV File |
|---|---|---|---|
| Demographics | DEMO_G | Age, sex, race/ethnicity, income, education | DEMO_G.csv |
| Blood Pressure | BPX_G | Systolic/diastolic blood pressure, number of readings | BPX_G.csv |
| Body Measures | BMX_G | Height, weight, BMI, waist circumference | BMX_G.csv |
| Smoking Questionnaire | SMQ_G | Smoking habits, exposure to secondhand smoke | SMQ_G.csv |
# Load the necessary CSV files into data frames
demo <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/DEMO_G.csv") # Demographics (cycle G = 2011–2012)
bpx <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/BPX_G.csv") # Blood pressure
bmx <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/BMX_G.csv") # Body measures
smq <- read.csv("/home/claire/Documents/GitHub/rforphysicians/data_sets/SMQ_G.csv") # Smoking questionnaire
# Check the structure of the data frames
str(demo)
## 'data.frame': 9756 obs. of 48 variables:
## $ SEQN : int 62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
## $ SDDSRVYR: chr "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" "NHANES 2011-2012 public release" ...
## $ RIDSTATR: chr "Both interviewed and MEC examined" "Both interviewed and MEC examined" "Both interviewed and MEC examined" "Both interviewed and MEC examined" ...
## $ RIAGENDR: chr "Male" "Female" "Male" "Female" ...
## $ RIDAGEYR: int 22 3 14 44 14 9 0 6 21 15 ...
## $ RIDAGEMN: int NA NA NA NA NA NA 11 NA NA NA ...
## $ RIDRETH1: chr "Non-Hispanic White" "Mexican American" "Other Race - Including Multi-Racial" "Non-Hispanic White" ...
## $ RIDRETH3: chr "Non-Hispanic White" "Mexican American" "Non-Hispanic Asian" "Non-Hispanic White" ...
## $ RIDEXMON: chr "May 1 through October 31" "November 1 through April 30" "May 1 through October 31" "November 1 through April 30" ...
## $ RIDEXAGY: int NA 3 14 NA 14 10 NA 6 NA 15 ...
## $ RIDEXAGM: int NA 41 177 NA 179 120 12 81 NA 181 ...
## $ DMQMILIZ: chr "No" NA NA "Yes" ...
## $ DMQADFC : chr NA NA NA "No" ...
## $ DMDBORN4: chr "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" ...
## $ DMDCITZN: chr "Citizen by birth or naturalization" "Citizen by birth or naturalization" "Citizen by birth or naturalization" "Citizen by birth or naturalization" ...
## $ DMDYRSUS: chr NA NA NA NA ...
## $ DMDEDUC3: chr NA NA "8th grade" NA ...
## $ DMDEDUC2: chr "High school graduate/GED or equivalent" NA NA "Some college or AA degree" ...
## $ DMDMARTL: chr "Never married" NA NA "Married" ...
## $ RIDEXPRG: chr NA NA NA "The participant was not pregnant at exam" ...
## $ SIALANG : chr "English" "English" "English" "English" ...
## $ SIAPROXY: chr "Yes" "Yes" "Yes" "No" ...
## $ SIAINTRP: chr "No" "No" "No" "No" ...
## $ FIALANG : chr "English" "English" "English" "English" ...
## $ FIAPROXY: chr "No" "No" "No" "No" ...
## $ FIAINTRP: chr "No" "No" "No" "No" ...
## $ MIALANG : chr "English" NA "English" NA ...
## $ MIAPROXY: chr "No" NA "No" NA ...
## $ MIAINTRP: chr "No" NA "No" NA ...
## $ AIALANGA: chr "English" NA "English" NA ...
## $ WTINT2YR: num 102641 15458 7398 127351 12210 ...
## $ WTMEC2YR: num 104237 16116 7869 127965 13384 ...
## $ SDMVPSU : int 1 3 3 1 2 1 2 2 1 3 ...
## $ SDMVSTRA: int 91 92 90 94 90 91 92 103 92 91 ...
## $ INDHHIN2: chr "$75,000 to $99,999" "$15,000 to $19,999" "$100,000 and Over" "$45,000 to $54,999" ...
## $ INDFMIN2: chr "$75,000 to $99,999" "$15,000 to $19,999" "$100,000 and Over" "$45,000 to $54,999" ...
## $ INDFMPIR: num 3.15 0.6 4.07 1.67 0.57 NA NA 3.48 0.33 5 ...
## $ DMDHHSIZ: int 5 6 5 5 5 6 7 5 5 4 ...
## $ DMDFMSIZ: int 5 6 5 5 5 6 4 5 5 4 ...
## $ DMDHHSZA: int 0 2 0 1 1 0 3 0 0 0 ...
## $ DMDHHSZB: int 1 2 2 2 2 4 3 2 1 2 ...
## $ DMDHHSZE: int 0 0 1 0 0 0 1 1 0 0 ...
## $ DMDHRGND: chr "Female" "Female" "Male" "Male" ...
## $ DMDHRAGE: int 50 24 42 52 33 44 61 43 51 38 ...
## $ DMDHRBR4: chr "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" "Born in 50 US states or Washington, DC" ...
## $ DMDHREDU: chr "College Graduate or above" "High School Grad/GED or Equivalent" "College Graduate or above" "Some College or AA degree" ...
## $ DMDHRMAR: chr "Married" "Living with partner" "Married" "Married" ...
## $ DMDHSEDU: chr "College Graduate or above" NA "Some College or AA degree" "Some College or AA degree" ...
str(bpx)
## 'data.frame': 9338 obs. of 27 variables:
## $ SEQN : int 62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
## $ PEASCST1: chr "Complete" "Complete" "Complete" "Complete" ...
## $ PEASCTM1: int 596 64 788 527 468 583 55 98 1005 625 ...
## $ PEASCCT1: chr NA NA NA NA ...
## $ BPXCHR : int NA 100 NA NA NA NA 100 96 NA NA ...
## $ BPQ150A : chr "No" NA "Yes" "Yes" ...
## $ BPQ150B : chr "No" NA "No" "No" ...
## $ BPQ150C : chr "No" NA "No" "No" ...
## $ BPQ150D : chr "No" NA "No" "No" ...
## $ BPAARM : chr "Right" NA "Right" "Right" ...
## $ BPACSZ : chr "Large (15X32)" NA "Adult (12X22)" "Adult (12X22)" ...
## $ BPXPLS : int 82 NA 72 82 70 90 NA NA 72 62 ...
## $ BPXPULS : chr "Regular" "Regular" "Regular" "Regular" ...
## $ BPXPTY : chr "Radial" NA "Radial" "Radial" ...
## $ BPXML1 : int 130 NA 140 140 130 120 NA NA 140 140 ...
## $ BPXSY1 : int 110 NA 112 116 110 96 NA NA 124 124 ...
## $ BPXDI1 : int 82 NA 38 56 64 32 NA NA 80 82 ...
## $ BPAEN1 : chr "No" NA "No" "No" ...
## $ BPXSY2 : int 104 NA 108 118 104 94 NA NA 126 122 ...
## $ BPXDI2 : int 68 NA 36 66 72 40 NA NA 74 84 ...
## $ BPAEN2 : chr "No" NA "No" "No" ...
## $ BPXSY3 : int 118 NA 106 120 106 94 NA NA 124 128 ...
## $ BPXDI3 : int 74 NA 38 58 78 0 NA NA 80 82 ...
## $ BPAEN3 : chr "No" NA "No" "No" ...
## $ BPXSY4 : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPXDI4 : int NA NA NA NA NA NA NA NA NA NA ...
## $ BPAEN4 : chr NA NA NA NA ...
str(bmx)
## 'data.frame': 9338 obs. of 26 variables:
## $ SEQN : int 62161 62162 62163 62164 62165 62166 62167 62168 62169 62170 ...
## $ BMDSTATS: int 1 1 1 1 1 1 1 1 1 1 ...
## $ BMXWT : num 69.2 12.7 49.4 67.2 69.1 28.8 10.8 23.6 54.6 63.5 ...
## $ BMIWT : int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXRECUM: num NA 95.7 NA NA NA NA 79.5 NA NA NA ...
## $ BMIRECUM: int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXHEAD : num NA NA NA NA NA NA NA NA NA NA ...
## $ BMIHEAD : logi NA NA NA NA NA NA ...
## $ BMXHT : num 172.3 94.7 168.9 170.1 159.4 ...
## $ BMIHT : int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXBMI : num 23.3 14.2 17.3 23.2 27.2 16.2 NA 15.4 20.1 18.2 ...
## $ BMDBMIC : int NA 2 2 NA 3 2 NA 2 NA 2 ...
## $ BMXLEG : num 40.2 NA 40.3 40.5 42.1 31 NA NA 38.7 43.3 ...
## $ BMILEG : int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXARML : num 35 18.5 36.3 37.2 35.2 28 16.2 24.8 33.4 37.5 ...
## $ BMIARML : int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXARMC : num 32.5 16.6 22 29.3 29.7 19.1 15.5 17.1 28.5 25.8 ...
## $ BMIARMC : int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXWAIST: num 81 45.4 64.6 80.1 86.7 59.8 NA 54.4 69.6 69.4 ...
## $ BMIWAIST: int NA NA NA NA NA NA NA NA NA NA ...
## $ BMXSAD1 : num 17.7 NA 15.6 18.3 21 13.5 NA NA 16.4 14.8 ...
## $ BMXSAD2 : num 17.9 NA 15.5 18.5 20.8 13.5 NA NA 16.3 14.7 ...
## $ BMXSAD3 : num NA NA NA NA NA NA NA NA NA NA ...
## $ BMXSAD4 : num NA NA NA NA NA NA NA NA NA NA ...
## $ BMDAVSAD: num 17.8 NA 15.6 18.4 20.9 13.5 NA NA 16.4 14.8 ...
## $ BMDSADCM: int NA NA NA NA NA NA NA NA NA NA ...
str(smq)
## 'data.frame': 6790 obs. of 30 variables:
## $ SEQN : int 62161 62163 62164 62165 62169 62170 62171 62172 62174 62176 ...
## $ SMQ020 : chr "No" NA "No" NA ...
## $ SMD030 : int NA NA NA NA NA NA NA 28 NA NA ...
## $ SMQ040 : chr NA NA NA NA ...
## $ SMQ050Q : int NA NA NA NA NA NA NA NA NA NA ...
## $ SMQ050U : chr NA NA NA NA ...
## $ SMD055 : int NA NA NA NA NA NA NA NA NA NA ...
## $ SMD057 : int NA NA NA NA NA NA NA NA NA NA ...
## $ SMQ077 : chr NA NA NA NA ...
## $ SMD641 : int NA NA NA NA NA NA NA 30 NA NA ...
## $ SMD650 : int NA NA NA NA NA NA NA 10 NA NA ...
## $ SMD093 : chr NA NA NA NA ...
## $ SMDUPCA : chr "" "" "" "" ...
## $ SMD100BR: chr "" "" "" "" ...
## $ SMD100FL: chr NA NA NA NA ...
## $ SMD100MN: chr NA NA NA NA ...
## $ SMD100LN: chr NA NA NA NA ...
## $ SMD100TR: int NA NA NA NA NA NA NA 6 NA NA ...
## $ SMD100NI: num NA NA NA NA NA NA NA 0.6 NA NA ...
## $ SMD100CO: int NA NA NA NA NA NA NA 6 NA NA ...
## $ SMQ621 : chr NA "I have never smoked, not even a puff" NA "I have never smoked, not even a puff" ...
## $ SMD630 : int NA NA NA NA NA NA NA NA NA NA ...
## $ SMQ660 : chr NA NA NA NA ...
## $ SMQ664M : chr NA NA NA NA ...
## $ SMQ664C : chr NA NA NA NA ...
## $ SMQ664W : chr NA NA NA NA ...
## $ SMQ664B : logi NA NA NA NA NA NA ...
## $ SMQ664O : chr NA NA NA NA ...
## $ SMQ670 : chr NA NA NA NA ...
## $ SMAQUEX2: chr "Home Interview (20+ Yrs)" "A-CASI (12 - 19 Yrs)" "Home Interview (20+ Yrs)" "A-CASI (12 - 19 Yrs)" ...
✏️ Exercise on the NHANES data sets n°1: import the
demo, bpx, bmx and
smq data sets from the data_sets folder
folder into R.
💡 The codebook for each dataset can be accessed either on the NCHS website
or directly in R using the function
nhanesCodebook(nh_table, colname) from the package
nhanesA (which I used to download the data). You’ll find
more details about installing packages at the end of this chapter.
Being able to access elements in a data frame is essential when working with data. Here are some common methods to select specific elements, rows, or columns.
# Look at the first respectively last few rows
head(demo)
tail(demo)
# Select columns by name
demo[, c("RIDAGEYR", "RIAGENDR")] # Selecting age in years and gender
vars <- c("RIDAGEYR", "RIAGENDR")
demo[, vars] # Alternative using variable `vars`
# Select elements by position
demo[1, 1] # Access the first element of the first column (the respondent sequence number of the 1st participant)
## [1] 62161
ind_mat <- cbind(c(1, 3, 5), c(2, 4, 6))
demo[ind_mat] # Access rows and columns using multiple indices
## [1] "NHANES 2011-2012 public release" "Male"
## [3] NA
# Select rows based on a condition
head(demo[, "RIDAGEYR"] > 50) # Logical condition for age greater than 50
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(!(demo[, "DMDHHSIZ"] > 3)) # Logical negation for total number of people in the household not greater than 3
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
demo[demo[, "RIDAGEYR"] > 50, ] # Rows where age > 50
demo[demo[, "DMDHHSIZ"] < 3, ] # Rows where total number of people in the household greater than 3
demo[demo[, "DMDHHSIZ"] >= 3, ] # Rows where total number of people in the household greater or equal 3
1.4 Comments
Comments can be added to the code in a script using the hash symbol
#.It is very, very important that you always comment every piece of your code, to make sure:
So, for scientific purposes, please comment your code!
Here’s an example of how I usually comment the scripts I use in my daily work:
💡 Use
----after numbered headers in comments to make your code more navigable and readable in long scripts (this is a common R style convention).